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Abstract 

This paper deals with the complexity of strings, which play an 
important role in biology (nucleotid sequences), information theory 
and computer science [1,2,4]. The d-complexity of a string is defined 
as the number of its distinct (i-substrings given in Definition 1. The 
case d = 1 is studied in detail. 
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1 Introduction 

Let X be an alphabet, and X k the set of all strings of length k over X. The i 
consecutive appearance of a letter a in a string will be denoted by a 1 . If i — then 
this means the absence of the corresponding letter. The definitions are from [2], 

Definition 1 Let d, k and s be positive integers, p = X1X2 ■ ■ ■ Xk G X k . A d- 

substring of p is defined where 
h > 1, 

l<ij+i-ij<d, for j = l,2,---,s - 1, 

i s < k. 

Definition 2 The d-complexity IQ(p) of the string p is the number of all distinct 
d-substrings of p. 

Example. Let X be the English alphabet and p = ISIS. In this string there 
are two 2-substrings of length 1 (I, S), four 2-substrings of length 2 (IS, II, SI, SS), 
four 2-substrings of length 3 (ISI, ISS, IIS, SIS), and a single one of length 4 (ISIS). 
Then K 2 (p) = 2 + 4 + 4+1 = 11. ■ 

In the case of strings of length k, consisting of different symbols, the <i-complexity 
will be denoted by N(k,d). For any k > 1 and p G X k we have k < Ki(p) < 
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k(k + 1) 

V ^ ■ If 1^1 > 2 > k > 1. d > 1 and P e X fe then fc < K d (p) < 2 k - 1. If p is 
a string, consisting of different symbols, and d a positive integer, then cii j£ ;(p) will 
denote the number of d-substrings of p which terminate in the position i. If k > 1 
and p e X k consists of different symbols, then for i = 1, 2, . . . , k 

a»,d(p) = 1 + o»-i,d(p) + a»-2,d(p) + • • • + o»-d,d(p), (1) 

2 Computing the value of N(k,d) 

The d-complexity of a string with different symbols can be obtained by the formula 

k 

N(k,d) = Y,ai,d(p) 

i=l 

where p is any string of k different symbols. Because of (1) we can write in the 
case of d > 2 

If 1 \ f 1 



a>i,d + ~j — r = a»-i,d + 7^ — 7 H 1- I o»-d,d 



d- 1 V ' d- 1/ V ' d- 1 

Let be ^ 

&i,d = a»,d + p and c i:d = (d - 

then 

Ci,d = Ci_i !d + Cj_ 2 ,d + . . . + Cj_d,d 

and the sequence c, ;( j is one of Fibonacci-type. For any d we have = 1 an d 
from this c\ t d — d results. Therefore the numbers Ci.d are defined by the following 
recurrence equations: 

c n ,d = c n -i,d + c„_2,d + • • • + c n - d ,d for n > 0, 

c n ,d = 1 for n < 0. 

These numbers can be generated by the following generating function: 

l + (d-2)z-z 2 z d 



F d (z) = E c «^- i-2z + **» 

n>0 

1 + (d - 3)z - (d - l)z 2 + z d+1 
(1 - z){\ - 2z + z d+1 ) 

The d-complexity N{k 1 d) can be expressed with these numbers c n ,d by the 
following formula: 



N(k,d) = ^- J (j2c i , d -kJ 



for d > 1 
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and 



or 



N(k,l) 



1 



k(k + 1) 



N(k,d) = N(k-l,d) + j—j(c k4 -l), for d > 1, fe > 1. 
If d = 2 then 

1 - 1 + z 



F2(z) = 



F(z) 



1 - 2z + z 3 1 - z - z 2 



+ F(z) 



where F(z) is the generating function of the Fibonacci numbers F n (with Fq 
0, Fx = 1). Then, from this formula we have 



and 



Cn.2 — F n+ i + F n — F n+2 



N(k, 2) = ^ F i+2 - fc = F fe+ 4 - fc - 3 

i=l 



Taking into account the formula for F n we have 
N(k,2) 



/ ^\ fe + 4 

i n + V5 \ i 



V5 



k- 3 



which can be approximated by 

[3. 0652475 • (1.6180339) fe + 0.5J - k - 3. 
Table 1 lists the values of N(k, d) for fc < 10 and d < 10. 



k v 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


2 


3 


3 


3 


3 


3 


3 


3 


3 


3 


3 


3 


6 


7 


7 


7 


7 


7 


7 


7 


7 


7 


4 


10 


14 


15 


15 


15 


15 


15 


15 


15 


15 


5 


15 


26 


30 


31 


31 


31 


31 


31 


31 


31 


6 


21 


46 


58 


62 


63 


63 


63 


63 


63 


63 


7 


28 


79 


110 


122 


126 


127 


127 


127 


127 


127 


8 


36 


133 


206 


238 


250 


254 


255 


255 


255 


255 


9 


45 


221 


383 


464 


494 


506 


510 


511 


511 


511 


10 


55 


364 


709 


894 


974 


1006 


1018 


1022 


1023 


1023 



Table 1 



121 



From the definition of the (i-substrings follows that 

N(k,d) = N(k,d+l), for d > k - 1 

but 

N(k, k - 1) = 2 k - 1 

and then 

N(k, d) = 2 k - 1, for any d>k-l. 
The following proposition gives the value of N(k,d) in almost all cases: 

Proposition 1 [3]. For k > 2d — 2 we have 

N(k, k - d) = 2 k - (d - 2) • 2 d - x - 2. 

The main step in the proof is based on the formula 

N(k, k-d-l) = N(k, k-d)-d- 2 d ~\ 

The value of N(k, d) can be also obtained by computing the number of sequences 
of length k of O's and l's, with no more than d— 1 adjacent zeros. In such a sequence 
one 1 represents the presence, one does the absence of a letter of the string in a 
given d-substring. Let 6k. d denote the number of fc-length sequences of zeros and 
ones, in which the first and last position is 1, and the number of adjacent zeros is 
at most d — 1. Then easily can be proved that 

bk,d = b k -i,d + h-2,d + • • • + bk-d,d, for k > 1, 
h,d = 1, 

bk,d = 0, for all k < 0, 

because any such sequence of length k — i (i = 1, 2, d) can be continued in order 
to obtain a similar sequence of length k in only one way (by adding a sequence of 
the form 4_1 1 on the right). For bk,d the following formula also can be derived: 

bk,d = 2bk-i,d - bk-i-d,d- 

If we add one 1 or in a internal position (e.g in the (k — 2) th ) of each bk-i,d 
sequences, then we obtain 2bk-i,d sequences of length k, but between these bk~i-d,d 
sequences will have d adjacent zeros. 

The generating function corresponding to b n _d is 
B d (z) = E &m*» = x _ z Z .._ z d = 

n>0 

Adding zeros on the left and/or on the right to these sequences, we can obtain the 
number N(k,d), as the number of all these sequences. Thus 

N(k, d) = b k ,d + 2b k -i4 + 36fe_ 2 ,d H h 
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(i zeros can be added in i + 1 ways to these sequences: on the left and i on the 
right, 1 on the left and i — 1 on the right, and so on). 

From the above formula, the generating function corresponding to the complex- 
ities N(k 7 d) can be obtained as a product of the two generating functions Bd{z) 
and A(z) = J2 n >o nz " = ~ z ) 2 > thus: 

n>0 



3 The 1-complexity 

We shall use the term complexity instead of the 1-complexity and the notation 
K(p) instead of Ki(p). A k- length string p over an n- letter alphabet has maximal 
complexity if 

k 

K(p) = ^min(n 4 ,fc-z + l). 

i—l 

In the following we give some results which can be proved immediately (in all cases 

P ex k y. 

a) fc<K(p)<^±^. 

b) For a trivial string p = a k , K(p) = k. 

c) If Xk 7^ Xi for i = 1, 2, • • • , k — 1, then 

K(a;ix 2 ■ • • a;*;) = k + K(xix 2 ■ ■ ■ Xk-i) 



k(k + 1) 

d) If p is not a trivial string, then 2k — 1 < K(p) < . 

e) If p = a^ba*- 1 for a fixed i (1 < i < [k/2\ ) then 

K(p) = (i + l)k-i 2 . 

tit - 1) 

/) If p has at least t different letters then K(p) > kt . 

(For the string a\CL2 • • -a;_i6 fe_ ' with aj 7^ dj for i 7^ j, and a, 7^ b we have 
equality in the above formula). 

5 ) If p e X k , q e Y m and X n y = then 

K(pq) = K(p) + K(q) + km. 

h) If p has only different letters then 

K(P) -*S±i>, 

K(pp K ) = 2k 2 , where p R is the reverse string of p, 
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k(k + 1) 

K(p") = h (n — l)k 2 , where p™ is p concatenated n times. 

k(k + 1) 

i) ~K(xix 2 ■ ■ ■ XkX\X2 ■ ■ ■ x n ) = h nk for 1 < n < k, x.- L ^ Xj for 

i ^ 3- 

There arise the following two problems: 

1. Find a minimal length string with a given complexity. 

This problem always has solution. (If the complexity is C, then in the worst case 
the string consisting of C identical letters represents a trivial solution). 

2. Find a k-length string with a given complexity, if it exists. 

These problems can be solved by a branch- and-bound-type algorithm. We shall 
construct a tree in which each node is a string. The root is a letter of the alphabet. 
Each node (i.e. each string) is obtained from its parent node by adding a new letter 
of the alphabet. The contraction will be continued at a node if its complexity is less 
than the given complexity, or in the case of the second problem only if its length is 
also less than k. This algorithm can be improved by omitting some branches, which 
do not produce essentially new strings, e.g. if we have a four letter alphabet, then 
the strings abd and abc are isomorphic (differ only some letters, but not the form). 
This can be given by the following recursive algorithm. Let ai,a 2 , • • • ,a n be the 
letters of the alphabet, k the length and C the desired complexity. The symbol 
"+" will denote the concatenation of a string with a letter, \w\ the length of string 
w. The algorithm starts with generate("a\ "). 

generate (w): 

if complexity (w) < C and \w\ < k 

then for i := 1, 2, • • • , k do generate( w + "a^ "). 

else if complexity (w) = C and \w\ — k then write (w). 

Of course, if C < k or C > k(k + l)/2 or doesn't exist a string with the desired 
complexity and length, then this algorithm produces nothing. To solve the first 
problem, we omit the restriction on length in the above algorithm. 

If there is always a string with a given complexity, the question is: there exists 
a nontrivial string with a given complexity or not? (A nontrivial string contains at 
least two different letters). The answer is yes, except some cases. 

Proposition 2 If C is a natural number different from 1, 2 and 4, then there exists 
a nontrivial string of complexity equal to C. 

Proof. To prove this proposition we give the complexity of the following fc-length 
strings: 

K(a fe - 1 6) = 2k - 1 for k > 1 

K{ab k - 3 aa) = 4fc - 8 for k > 4 

K(abcd k - 3 ) = 4fc - 6 for k > 3 

These can be proved immediately from the definition of the complexity. 
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1. If C is odd then we can write C — 2k— \ for a given k. From this k = (C+l)/2 
results, and the string a k ~ 1 b, has complexity C. 

2. If C is even, then C = 21. 

2.1. If £ = 2ft, then 4fc - 8 = C gives 4fc - 8 = 4ft, and from this k = h + 2 
results. The string ab k ~ 3 aa has complexity C. 

2.2. If £ = 2h + 1 then 4fc - 6 = C gives 4fc - 6 = 4ft + 2, and from this 
k = h + 2 results. The string abcd k ~ 3 has complexity C. m 

In the proof we have used more than two letters in a string only in the case of 
the numbers of the form 4ft + 2 (case 2.2 above). The new question is, if there exist 
always nontrivial strings formed only of two letters with a given complexity. The 
answer is yes anew. We must prove this only for the numbers of the form Ah + 2. 
If C = 4ft + 2 and C > 34, we use the followings: 

K(ab k - 7 abbabb) = 8k- 46, for k > 10, 

K(ab k - 7 ababba) = 8k - 42, for k > 10. 

If ft = 2s, then 8fc — 46 = 4ft + 2 gives k — s + 6, and the string ab k ~ 7 abbabb 
has complexity 4ft + 2. 

If ft = 2s + 1, then 8fc — 42 = 4ft + 2 gives k = s + 6, and the string ab k ~ 7 ababba 
has complexity 4ft + 2. For C < 34 only 14, 26 and 30 are feasible. The string 
ab 4 a has complexity 14, ab 6 a complexity 26, and ab 5 aba complexity 30. Easily can 
be proved, using a tree like in the above algorithm, that for 6, 10, 18 and 22 such 
strings does not exist. Then the following is true. 

Proposition 3 If C is a natural number different from 1, 2, 4, 6, 10, 18 and 

22, then there exists a nontrivial string formed only of two letters, with the given 
complexity C. 

In relation with the second problem a new one arises: How many strings of 
length k and complexity C there exist? For small k this problem can be studied 
exhaustively. Let X be of k letters, and let us consider all strings of length k over 
X. By a computer program we have got Table 2, which contains the frequency of 
strings with a given length and complexity. 

length=2 length=3 

complexity 2 3 complexity 3 4 5 6 

frequency 2 2 frequency 3 18 6 

length=4 

complexity 456 7 8 910 
frequency 4 36 48 144 24 

length=5 

complexity 5 6 7 8 9 10 11 12 13 14 15 
frequency 5 60 200 400 1140 1200 120 
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length=6 

complexity 6 7 8 9 10 11 12 13 
frequency 600 90 00 

14 15 16 17 18 19 20 21 

300 990 270 5400 8280 19800 10800 720 

Table 2. 

Let \X\ = k and let fk(C) denote the frequency of the k- length strings over X 
having a complexity C. Then the following proposition is true. 

Proposition 4 f k (C) = if C < k or C> k ^ k + l \ 

fk(k) = k, 



/ fc (2fc-l)=3fc(fc-l), 

, (k{k + l) \ k{k-l)k\ 
h [—2 1 )= 2 ' 



Proof. The first two and the last ones are evident. Let us prove the third. If the 
complexity of a ^-length string is 2k— 1, then it must contain exactly two substrings 
of length 1, 2, • • • , k — 1, and only one of the length k, and must be formed of two 
letters. (If it contains 3 letters than the complexity is > 3fc — 3, see the property /). 
) In this case the 2-lentgh substrings can be only aa, ab or aa, ba or ab, ba, and with 
these only strings of the form a k ~ 1 b, ba^ 1 and {ab) k / 2 (if k is even) or (ab)^ k ~ 1 ^ 2 a 
(if k is odd) can be generated. In every case the two letters can be chosen in k(k— 1) 
ways, and because of the three above possibility fk{2k — 1) = 3k(k — 1). 

The last but one comes from the following: k letters can form kl different 
fc-length strings of maximal complexity, and the complexity of such a string can 
be diminished by one if we replace a letter by another already being present in 
that string. We can choose a position for one already given in k(k — 1) ways, and 
because of the symmetry of the letters in these positions, the number of new strings 
is k\k{k - l)/2. ■ 

As regards the distribution of the frequaency 0, we can prove the following. 

IfC = k + l,k + 2,---,2k-2, then f k (C) = 0. 
// C = 2k, 2k + 1, • • • , 3k - 5, then f k {C) = 0. 

Proposition 5 

Proof. The complexity of the trivial fc-length string is k, and this contains only 
one letter k times. If in such a string we replace one or more letters by a new one, 
the number of substrings of any length, except the whole string, will increase by 
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at least one. Then the complexity will be at least 2k — 1, and there are no strings 
with complexity between k and 2k — 1. To prove the second formula, we use the 
following, easy to see assertion: if a k-length string has n i-length substrings, then 
it has at least min(n,k-i+l) (i+l)-length substrings. 

By replacing a letter with a new one in the strings of complexity 2k — 1, we 
obtain at least complexity 3k — 3. If we replace one a (or more) with one b (or 
more), or inversely, but not to obtain a trivial string, and keeping the length, the 
number of 2-length substrings will increase by 3, and by the above assertion will 
increase the number of 3—, 4—, • • • , (k — 2)— length substrings. Then the complexity 
will be at least 2 + 3(k - 3) + 2 + 1 which is 3fc - 4.b 

Strings of length k may have complexity between k and k(k + l)/2. Let us 
denote by b k the least number for which 

A(C)^0 for all C with b k < C < M^±ll. 

The number b k exists for any k (in the worst case it may be equal to fc(fc + l)/2). 
In the Table 2 we can see that 63 = 5, 64 = 7, 65 = 11 and b$ = 14. 

We give the following conjecture: 
£(£ + 1) 

Conjecture. If k = — '— — - + 2 + i, where £ > 2 and < i < £ then 

£(£ 2 - I) 

b k = [ ' +31 + 2 + 2(1+1). . 
We can easily see that jk{bk) 7^ for fc > 5, because of ~K.{ab k ~ l ab t ~ 2 ) = b k - 



Conclusions 

We have studied the d-complexity of strings, which is defined as the number of all 
distinct d-substrings of it. The concept of the d-substring is a generalization of that 
of the substring: not only a contiguous part of a string can be chosen as substring, 
but parts which have distance between them no greater than d. The d-complexity of 
strings with different letters only, can be computed by a Fibonacci-type sequence. 
Proposition 1 gives a formula for this complexity in almost all cases. 

The 1-complcxity is studied in detail. In propositions 2 and 3 we prove that, 
except some cases, a string with a given complexity can be associated to any natural 
number. The frequency of strings with a given complexity is also considered. It is 
conjectured that if we consider strings of length k, there exists a value between k 
and k(k + l)/2 from which frequency no more exists. 
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